Christina's LIS Rant
Comps readings this week
not terribly productive this week due to the holidays (and lots of cookies - both baked and eaten - and cupcakes, a pie... and shopping!)
Finished the Borgman book (see
review).
Soergel, D. (2002). A framework for digital library research: Broadening the vision. D-Lib Magazine, 8(12)DOI:10.1045/december2002-soergel
- great short piece, to the point. Written in 2002, why don't more DL people read and follow his stuff? (this means you, AIAA!)
- guiding principles: support research, scholarship, education and practice; go beyond the horseless carriage: "Some see DLs primarily as a means for accessing information, but in order to reach their full potential, DLs must go beyond that and support new ways of intellectual work"
- Themes
1) a DL is content + tools (providing access to the content alone is not sufficient)
2) DLs should have both individual and community spaces (yes!): "support users who work with materials and create their own individual or community information spaces through a process of selection, annotation, contribution, and collaboration."
3) DLs need semantic structure
4) linked data structures for navigation and search
5) powerful search
6) interfaces should guide users
7) DLs should have ready made tools to help users make use of the information contained
8) design should be informed by studies of user behavior (not revolutionary, but hey, we need to keep repeating this until people do it)
9) evaluation needs to take future functionality into account
10) legal/rights management issues should be addressed with new tech (this is the only thing that people are really working on or doing! sigh.)
11) sustainable business models
Soergel, D. (in press). Digital Libraries and Knowledge Organization. In S.R. Kruk and B. McDaniels, ed. Semantic Digital Libraries. New York: Springer.
These two Soergel pieces were added in response to criticism that I didn't have anything on "digital libraries". Good stuff. Particularly section 1.3 that lists advanced dl functions. Many of these are already provided by some dls but certainly anyone who runs a dl should look at this list to see what they could usefully add.
(ok, this has been bugging me and will hopefully get another post of its own, but library licenses to or DRM on databases and DLs do not support mashing-up or making new information in a workspace. Serious issue here when the DL provider won't give you the tools you need and then actively prevents you from doing what you want to do with DRM).
Started: Yin, R. K. (2003). Case study research: Design and methods (3rd ed.). Thousand Oaks, CA: Sage Publications.Labels: comps
Review: Scholarship in the Digital Age
Borgman, C. L. (2007).
Scholarship in the digital age: Information, infrastructure, and the internet. Cambridge, MA: MIT Press.
This book is one of the many I'm reading for my comprehensive exams. Borgman does an excellent job distilling numerous research streams and large bodies of work into a well-written book. I highly recommend this book for librarians, computer scientists, domain scientists (and researchers in the humanities or social sciences), and anyone else who is interested in (or works to support) the present and future of scholarly work.
There were some overarching themes that she comes back to repeatedly over the course of the book:
- disciplinary or subdisciplinary differences in scholarly work are incredibly important but there are some common issues with scholarly communication and scholarly work that transcend these differences
- while e has changed a lot, the underlying social and cultural practices have not changed that much
- scholarly publications, whether journal articles for scientists or books for humanities researchers, are well organized and findable. For a price anyone can read these and there are many people devoted to providing access. There is no similar framework for data, even though data is becoming an end product in some areas of research.
- scholarly infrastructure needs attention to support e-research - most of the funding and attention has gone to the technical building of the infrastructure and not the understanding of use, the policies, or the information organization. Funding for these repositories is not stable, either, so there really is no parallel or other function similar to what libraries and archives do (and libraries and archives are not really doing this for data)
- there's no link through the "value chain" of science. There needs to be a link from the data set to the journal article and vice versa - but this is difficult because things are murky and the lines are fuzzy
- scholarly publications perform these major tasks: legitimization; dissemination; and access, preservation, and curation. Changes to the scholarly system need to account for and still support these tasks, whether if it's for data or for open access or for e-whatever. Repositories that don't certify and show priority and whose contributions don't go toward tenure and promotion, will not gather a lot of submissions.
A few other nice bits:
- comparison of STS with information science and social informatics, and information systems
- definitions of information (Yes, there is more than Shannon!)
- the appreciation of the importance of the shift in the balance between public and private wrt informal scholarly communication.
- appreciation of the fact that scholars are not depositing their work in institutional repositories
- that norm of open sharing/posting of journal articles does not seem to correlate with the sharing of data (this is interesting and a little counter-intuitive but in physics, arXiv is the norm but sharing data isn't while in biotechnology sharing data is the norm, but not sharing articles)
- Europe's Database Directive - holy cow! I had no idea, how horrible!
- data has many different stages and levels - at which stage or when is it best to share
When (hopefully someday) I teach the reference in science and technologies class, I'm going to assign Chapter 4 on scholarly communication. This was a review for me because I've done a lot of reading in the area, but it's a nice summary.
Unfortunately, some of the later chapters seem really redundant. Also, it doesn't seem like she really suggests any concrete ways forward. More research is needed, and she suggests a research agenda, but no way out. (I was hoping - because I don't see how we can do anything revolutionary here vs. evolutionary).
Labels: comps
Comps readings this week
White, H.D. & McCain, K.W. (1989) Bibliometrics. Annual Review of Information Science and Technology 24, 119-186.
Good, but like all ARIST articles, long. Not sure as valuable as some of the other things in this area.
MacKenzie, D. A., & Wajcman, J. (1999). Introductory essay: The social shaping of technology. In D. A. MacKenzie, & J. Wajcman (Eds.), The social shaping of technology (2nd ed., pp. 3-27). Philadelphia: Open University Press.
- in my initial set of readings I emphasized science, but it does really make sense to also look at technology (so
sss became
sts). This essay first looks at technological determinism - technologies change then they have a one-way impact on society - as a theory of society and then as a theory of technology. The authors are arguing for some middle ground. Technology is important and can shape society, but technology can also be political and can be shaped by society. It can require certain social patterns or be "more compatible with some social relations than others" (p5). They go on to discuss the relationship of science to technology and more about economic and other ways society shapes technology. I should probably re-read this to make sure I've got the whole thing.
Winner, L. (1999). Do artifacts have politics? In D. A. MacKenzie, & J. Wajcman (Eds.), The social shaping of technology (2nd ed., pp. 28-40). Philadelphia: Open University Press.
- this is the standard article on this topic that everyone cites. It's not as clear or thorough as the article above, but is worthwhile. He also takes the middle ground: "rather than insist that we immediately reduce everything to the interplay of social forces, [technological politics] suggests that we pay attention to the characteristics of technical objects and the meaning of those characteristics" (so not technologicial determinism of society and not societal determinism of technology). Here's a nice quote
to our accustomed way of thinking, technologies are seen as neutral tools that can be used well or poorly, for good, evil, or somthing in between. But we usually do not stop to inquire whether a given device might have been designed and built in such a way that it produces a set of consequences logically and temporally prior to any of its professed uses.(p32)
and "technologies are ways of building order in our world" - so the technologies society adopts reflect and influence the social order/ structure of the society. Otherwise, certing technologies might be "unavoidably linked to particular institutionalized patters of power and authority"(p.38).
Kazmer, M. M., & Xie, B. (2008). Qualitative interviewing in internet studies: Playing with the media, playing with the method. Information, Communication & Society, 11(2), 257-278. DOI:
10.1080/13691180801946333- An excellent article. It provides a review of the literature but adds examples from their own research. It provides the pluses and minuses of conducting interviews via IM, e-mail, and telephone or in person. Looks like one might have been surprised to note that a participant posted the entire IM chat transcript to their web page... the next participant apparently read those notes prior to the interview (that could be awkward if you weren't expecting it ).
The first 5 chapters of:
Borgman, C. L. (2007). Scholarship in the digital age: Information, infrastructure, and the internet. Cambridge, MA: MIT Press.
(more on this in a future post)
Labels: comps
Initial thoughts on NodeXL
NodeXL is a project from Microsoft and collaborators to bring some decent social network analysis tools to the masses - that is, the masses who have access to Microsoft Office. It's not supposed to be the most powerful thing on the planet, but functional for people who are comfortable in Excel and aren't comfortable either programming or using one of the other SNA tools (primarily UCInet and Pajek).
My very quick take
1) still a PITA to install and run (go ahead and get .net 3.5 sp1 first, reboot, then install it)
2) good idea - a lot of potential
3) easy to make attractive pictures - a lot more difficult to actually get the picture to show the information I want -- specifically, I can't seem to label the nodes (I can't get it to put the info that is in column a in the secondary label column for love nor money and I am Excel functional), nor size them by centrality measure (this is a very standard thing to do)
4) very nice that it sucks Pajek .net files in without issues (this was an old network I had sitting around)
5) it's nice you can click on a vertex in the spreadsheet and it will be highlighted on the graph
6) it's not good that changing the colors of the nodes actually triggers it to rearrange them (grrr)
7) apparently you can use equations to change colors and stuff - I haven't tried that yet.
Labels: sna
Comps readings this week
Finished:
Downey, G. L. (1998). The machine in me: An anthropologist sits among computer engineers. New York: Routledge.
Ok, first of all, these aren't "computer engineers" these were mechanical and aerospace engineers who were dealing with CAD/CAM programs - either in the work place (finding, selecting, implementing, bitching), or writing them, or learning about them as a senior undergrad or grad students.
Anyway, kind of an interesting book about the illusion of control and the idea that automation would lead to productivity which would increase America's competitiveness (more to it, of course). I learned more about the development of this stuff and also some about former and current computer companies.
Labels: comps
IEEE eScience: WOOL: A Workflow Programming Language
Speaker Geoff Hulette
workflows
(oh thank goodness, some definitions!)
scientists want to focus on domain problems but use computer models so must deal with programming
workflows
- separation of coordination from computation
- data flow naturally expresses parallelism
activities- primitive computations
- any number of inputs and outputs
- connections connect activity inputs and outputs
workflow is a collection of activities and connections
science – triana, vaverna, vistrails, kepler
business – bpel4ws, wsfl
mac – quartz composer
abstract – agwl
another language –
- simple workflow for neuroinformatics and image processing
- other systems don’t scale down easily wrt complexity
abstract – what not how
- runtime neutral
- same workflow representation ona laptop, cluster, or cloud
- text based with a simple syntax ( to use text tools like subversion)
- rich type system
- hierarchical and composable workflows
Labels: IEEEeScience08
IEEE eScience: Final Keynote
Edward Seidel, Director, Office of Cyberinfrastructure, NSF (since the summer, I guess)
he’s a physicist and a computer scientist, formerly of LSU
how he got here – started in HEP then moved to general relativity-
- a series of problems that require supercomputers, computer scientists, high speed nets, grids, visualization
black hold perturbation theory – very hard, try supercomputers
black hole collisions
neutron star collisions
themes –
costs a lot, requires collaboration, idea to reuse tools..
coastal modeling – an example for cyberinfrastructure
another example LHC – quantity of data
data driven era of science
NSF vision (see Atkins report 2003 – Cyberinfrastructure Vision for the 21st century)
1 virtual organizations for distributed communities (large-scale teams to solve complex problems)
2.
hpc3. data visualization/interaction
4. learning and workforce
Data Net
- developing communities and tools to solve complex problems –example climate change : overlaying chemistry, environmental, etc.,
Virtual organizations for distributed communities
learning and workforce development
- need computational science (not cs but broader?) programs
Teragrid
- track 2 (Texas, Tenn, Pittsburgh… another under review)
- track 1 petascale university of Illinois – but will only serve a small number of scientists
next generation – xd – xtreme digital (still looking at this)
- innovative ways to support digital services, has explicit visualization component
open science grid and loosely coupled science grids
- integrate national or international needs into campus needs/support
- some of this is aimed at the LHC, but there are other efforts that can use it, too
but what about software?
- no real program at NSF to build the software to take advantage of these nets
applications to take advantage of this
virtual organizations to work out how to do all of this
Blue Waters
IBM Power 7 based system
online 2011
1 petaflop sustaind performance on real applications
>200k cores
idea is not a whole bunch of jobs – but a few jobs that need this kind of complexity
coupled dynamic ensemble simulations, real-time simulations – real policy issues like scheduling
will keep going with the international programs
translation to programs – get bandwidth to the center but not to the labs where its needed
federated id management
what next?
pick up and do remainder of things in vision
need end-to-end integration
need to support a computational science community
he comes up with the third pillar, too, experiment, theory computational science
Labels: IEEEeScience08
IEEE eScience: Science in the Clouds
Alexander Szalay, Science in the Cloud
movement over time first towards pcs and small scale and then back through Beowulf clusters, the grid, and now clouds
paradigm that the data was in one location and had to be moved and cleaned up afterward wasn’t efficient, so now we’re looking at distributed data
20% of the worlds servers are going to the big 5 (clouds?)
Clouds – have some great benefits, the university model really doesn’t work anymore
clear management advantages
base line cost that is hard to match
easy to grow dynamically
science problems:
- tightly coupled parallelism (low latency, MPI- maybe message passing interface?)
- very high I/O bandwidth
- geographically separate data (can’t keep it in one place, in synch…)
clouds are fractals – exist on all scales
little, big, all science can use the clouds
need a power law distribution of computing facilities for scientific computing
Astro trends -
Cosmic Microwave Background (COBE, 1990, 1000pixels; WMAP, 2003, 1Mpixels; Planck, 2008, 10Mpixels)
Galaxy Redshift Surveys ( 1986-3500 to 2005 750000)
Time domain is new
Lsstars (?) - Petabytes
LSST – total dataset in the 100 Petabytes
scientific data doubles every year – successive generations of inexpensive sensors (new CMOS) and exponentially faster computing – this changes the nature of scientific computing across all areas of science
- but it’s harder to extract knowledge
data on all scales – kb in Excel spreadsheets manually maintained vs. multi-terabye archive facilities (all medical labs are gathering digital data all the time)
industrial revolution in collecting scientific data
acquire data (doubling)…. bottom of the funnel, publication, only growing at 6%
challenges:
access – move analysis to the data
discovery – typically discovery is done at the edges so more doesn’t give us much more… but opening up other orthogonal dimensions gives us more discoveries. federation still requires data movement
analysis – only max NlogN algorithms possible,
data analysis-
on all scales across multiple locations
assumption has been that there’s one optimal solution, that we just need a large enough data set… with unlimited computing power
but this isn’t true any more, randomized incremental algorithms
analysis of scientific data
- working with Jim Gray – coping with the data explosion starting in astro with the SDSS… first data release from Sloan was 100GB, now the dataset is about 100TB
- interactions with every step of the scientific process
Jim Gray:
- scientific computing revolving around data – take analysis to data, need scale-out solution for analysis
- scientists give the database designer top 20 questions in English, and the database designer or computer scientist can design the database accordingly
- build a little something that works today, and then build bigger /scale and build something working tomorrow – go from working to working – build what the world looks like today, not for tomorrow
Projects at JHU
SDSS – finished, final data release has happened
- work has changed
- final archiving in progress (UChi library, JHU library, CAS Mirrors at FnAL +JHU Physics & Astro)
archive will contain >100 TB
- all raw data
- all processed calibrated data
- all versions of the database
- full email archive (capture technical changes not in official drawings) and technical drawings
- full software code repository
- telescope sensor stream, IR fisheye camera, etc.
Public use of the skyserver
- prototype in data publishing
- 500 Million web hits in 6 years, 1M distinct users but only 15k astromers in the world
- 50,000 lectures to high schools
- delivered >100B rows of data
interactive workbench
- sign up, own database, run query, pipe to own database… analysis tools to transfer only plot and not entire database over the wires
- 2,400 power users
GalaxyZoo
built on SkyServer, 27M visual galaxy classification by the public
Dutch school teacher discovery
Virtual Observatory
collaboration of 20 groups
15 countries – international virtual observatory alliance
interfaces were different, no central registry, underlying basic data formats were agreed upon
sociological barriers are much more difficult than technical challenges
Technology
- petabytes
- save, move, some processing near the instrument (for example) in Chile
- funding organizations have to understand the computing costs over time
- open ended modular system
- need Journal for Data (overlay to bridge the gap so that data sets don’t get lost) – curation is key, who does the long-term curation of data
Pan-STARRS
- detect killer asteroids
- >1Petabyte/year
- 80TB SQL Server database built at JHU – largest astro db in the world
Life Under Your Feet (http://lifeunderyourfeet.org/en/default.asp)
- role of soil in global change
- a few hundred wireless computers with 10 sensors each , long term continuous data, complex database of sensor data, built from the sky server
once the project is online, linear growth, exponential growth comes from new technologies new cameras … future might come from individual amateur astronomers using 20MB cameras on their telescopes and systematically gathering data
more growth coming from simulations (software is also an instrument)
(example of one that was so big the tape robot was inaccessible so the data is never used)
also need interactive, immersive usages (like for turbulence)
- store every time slice in the database
- turbulence.pha.jhu.edu (try it today!)
commonalities
- huge amounts of data, need aggregates but also access to raw data
- requests enormously benefit from indexing
Amdahl’s Laws for a balanced system – we’ve gone farther and farther from these
Comparisons of simulations and their data generation vs. what’s available in the computers
data analysis is maxing out the hardware because of the 10-100 TB – no one can really do anything with over 50 TB
IO limitations for analysis
we’re a factor of 500 off of what we can do to get to a 200TFlop Amdahl machine
they built a high io system using cheap components
the large datasets are here, the solutions are not – systems are choking on IO
scientists are cheap
data collection is separated from data analysis
(big experiments just collect data and store it, scientists come along later and analyze the data- decoupled)
How do users interact with petabytes
- can’t wait 2wks to do a sql query on a petabyte
- python crawlers
- partition queries
- MapReduce/Hadoop – but can’t do (or are very difficult) complex joins you need to do for data analysis
William Gibson – “The future is already here. It’s just not very evenly distributed”
data cloud vs.
HPC or
HTCjournal for data?
- with ApJ?
- example postdoc writes a paper, only a table goes into the suppl data, journal can't take the terabytes real data - need another archive for this that's linked to science article.
Labels: IEEEeScience08
IEEE eScience: Sensor Metadata Management and Its Application in Collaborative Environmental Research
Sensor Metadata Management and Its Application in Collaborative Environmental Research
Speaker Sebastian Michel
SwissEx – Swiss Experiment
www.swiss-experiment.ch
provision of web based tech, wireless. etc to for the use of environmental science
experimental areas overlap – so it would be good for researchers to share data
sensormap and gsn access
uses semantic media wiki (really?)
sensor data > {static data, etadata, manual measurements, streaming data}> {wiki, semantic wikis, GSN}>{sensor map, gsn interface}
gsn? (maybe: http://gsn.sourceforge.net/)
data
sensor provided data, static experiment metadata, static sensor metadata, dynamic sensor information (available, broken), dynamic data quality info
semantic media wiki lets you enter rdf triples, but the format is too much work (particularly on a hard day up the mountain wrestling the instrument into place)
they use forms – html table – more user friendly
(q: wonder if they tried the halo plugin?)
(q: wonder if they could automate information about the equipment)
SensorMap from Microsoft Research – put the sensor on virtual earth and can get the data by clicking
question from audience – use relational database?
use OGC with built in (?)
my question – about halo – no he didn’t try it.
IEEE eScience: ARCHER: An Enabler of Research Data Management
ARCHER – came in late
Speaker: Anthony Beitz
research data management
life cycle
conceive, design, experiment, analyze, collaborate, publish, expose
(this seems strange to put collaborate there)
theirs helps from experiment, analyze, publish
includes
- research repository – data curation, with rich metadata
- concurrent data capture and telemetry
- dataset manager including (web and desktop client for large datasets so no time out), metadata editing tool
- collaborative and adaptable research portal dev environment
currently available www.archer.edu.au
uses plone? xdms – also does automatic metadata extraction upon deposit
core metadata based on STFC’s scientific metadata model
flexible metadata for samples, datasets, and datafiles
stfc project > experiment > dataset > data file
for experiment: publication, keywords, topic list, investigator, sample
DIMSIM – distributed integrated multi-sensor & instrument middleware
- convenient for large data in bulk from instrument to repository
- enables concurrent analysis
XDMS – scientific dataset manager
web tool for researchers to manage and curate their research process
- formalized research data management
- automated metat extraction
- persistent identifiers for each dataset (I guess at the whole dataset level – using handle technology)
- powerful search capabilities ( he didn’t say much here)
(q: tell more about search – just metadata?)
- secure
- publish from there into their institutional repository
- currently customized to crystallography
Metadata editor validates schema
Hermes is the desktop tool
- doesn’t have data time out problems
Labels: IEEEeScience08
IEEE eScience: Web services architecture for visualization
Web services architecture for visualization
Jeremy Walton, Numerical Algorithms Group
convert data to information leading to knowledge
visualization and statistical analysis – visual analytics
ADVISE system–
- visual analytics
- enables creation of analysis application (generic or domain specific)
- service oriented architecture
2 existing systems
IRIS explorer – visualization algorithms
GenStat – statistical algorithms
merge these > advise > clients
3 layers visualization components > web services middleware > user interface
modules connected in pipelines
(modules from the older systems, plus some new ones)
middleware – stateful service so that data in calls can be re-used, maintains a model of pipeline state..
user interface – in java, can use pre-built pipelines or build your own….
(he demoed one where you could pick off the modules and build it – and another with dropdown boxes and a go button)
future work-
real user data sets
intelligent data types (streaming data – that’s cool but does it come packaged?, sharing large data)
more fine grained visualization services
connecting to other web services
Labels: IEEEeScience08
IEEE eScience: Dan Reed's Keynote
Keynote:
Daniel A. Reed
Microsoft Research
Cloud Seeding: Watering Research Flowers
(slides will not be shared per his request)
3 pillars of discovery: theory, experiment, and computational e-Science (ok, so observational sciences don’t provide new discoveries?)
(ew viscous flow in disposable diapers)
truisms- bulk computing is almost free but software and power aren’t
ubiquitous sensors – but lagging in data fusion
moving lots of data is still hard
people are expensive – robust software is extremely labor intensive
scientific challenges are complex and social engineering is not our forte – increasingly social engineering is the limiting factor in our success (pesky users)
political/technical approaches much change or we risk solving irrelevant problems
how do we develop effective programming tools for the average jane who is using at best Perl, MATLAB, etc…. typical scientist or engineer vs. savvy computer scientist
social implications of the data deluge
- hypothesis driven (data was pretty scarce in the past)
- exploratory –“what correlations can I glean from everyone’s data?”
- this requires different tools and techniquest
- massive multidisciplinary data rising rapidly at an unprecedented scale
clash of cosmologies – article that astro was going from observation to mere measurement
scientific computing is along for the ride instead of in the drivers seat – we use GPUs bcs the commercial industry is driving innovation in the processors and the average desktop machine doesn’t really need to be any faster for the typical office or home tasks… so we have to use these tools developed for the commercial market
use these commercial cores and parallelism – new approaches are needed to really take advantage of the parallelism
cloud application frameworks (slide from Dennis Gannon)
OS virtualization > software as a service > parallel frameworks (this is a triangle)
Hadoop over EC3 is 2/3 the way toward OS virt from parallel
GFS, BigTable,MapReduce at parallel frameworks
Amazon S3/EC2 at OS virt
Microsoft mesh at Software as a service
Microsoft Azure Services Platform- cloud services
services platform (hosted sharepoint, CRM, SQL, .NET….)
(I was hearing about software as a service ages ago so I guess now it’s becoming more mainstream)
Data Center costs for cloud computing- (physical plant and power issues are huge)
land 2%
core and shell costs 9%
architectural 7%
mechanical and electrical – rest
energy and supporting infrastructure cost like 4x the cost of the 1u server
researchers have different perceptions about the cost of ownership bcs they don’t see the power bills, only the server and people costs
15MW is what you need to build a cloud computing facility (15 yr amortization)
servers 2k each, 50,000
commercial power .07kWhr
security/administration 15 people @$100k/yr
$3M/months related to power
instruments and infrastructure
- from desktop to lab level to organization level maybe state level then national level and regional level
building blocks of cyberinfrastructure – this is his slide from 10 years ago
in the past 10 years
- commodity clusters – proliferation, of inexpensive hardware, race for MachoFLOPS, broadbase for enabling software, low level programming changes
- grids and distributed services – multidisciplinary collaborations, less broad base for enabling software
research money vs. production, maintenance, on going reliable work
teraflop is no news now, done in the lab with linux clusters
security – PII, HiPPA – research machines still have to be secure and patched and this is a real cost (not just an inconvenience)
business
- capital is cheap, labor is expensive, costs are explicit
academia, govt
- capital is hard, labor is seeming cheap (students!), costs are implicit
funding is at best flat
infrastructure inefficiency reduces research funding – need to become more efficient
so this is his argument to go to cloud computing
- elasticity
- economies of scale
- efficiency
- cost clarity
- pay as you go
- support
- geodistribution (security)
Labels: IEEEeScience08
IEEE eScience: Experiences from Cyberinfrastructure Development for Multiuser Remote Instrumentation
Experiences from Cyberinfrastructure Development for Multiuser Remote Instrumentation
Presenter Prasad Calyam, Ohio Supercomputer Center
remote instruments
- v.v. expensive in startup and maintenance so let other people use yours, use someone elses
- becoming a requirement from funding agencies
- better ROi on cost
- access via internet
pilot to leverage investments in networking hpc and scientific instruments – paid for by regents of university system of Ohio
SEM, raman spec, telescopes, accelerator, nmr spec
remote user site
-remove observation, operationi voice/text chat, lab notebook
osc
- portal dev, data storage, analytics, security (network and data)
instrument lab
- resource scheduling, billing, use policy, sample handling
Challenges
- last-mile network bottlenecks
- communications – remote user/operator communications, multi-device views for user workflows – multiple remote users, or single remote users – not useful off the shelf
- dead mans switch – if operator becomes incapacitated (or bad things happen at the instrument while being operated remotely)
- security for the data and network
Policy challenges
- prioritizing
- licensing
- SLA with vendors
- safe-use policy expert vs. novice use policies
- billing
Case studies
they build custom interfaces – the COTS ones they tried had some wonky problems
RICE features
network aware video encoding
control blocking – lock passing – only one operator at a time so they don’t step on each other
web-portal features
user account
management of instruments, people, data
access control
chat
storage of experiment data
more reasons
- remote participants can watch an expert control a scientific instrument – efficiently and reliably (network awareness mitigates instruments damage)
- expert can pass control to remote user (for training)
future directions
- wikis, lab notebooks, other communication tools
- human-centered remote so you get the “at-the-instrument” experiment
from audience: group working on remote instruments, etc., please join, etc.
Remote instrumentation services in grid environment
OGF RISGE-RG
forge.gridforum.org/sf/projects/risge-rg
Labels: IEEEeScience08
IEEE eScience: Classification of Different Approaches for e-Science Applications
Unfortunately not as high a level overview - I mean it was basically a typology of requests for and uses of the European grids - so one level of escience, for sure.
Classification of Different Approaches for e-Science Applications
what is e-science
sits on top of
theory & models + computational techniques + experiment
grids
HPC grids and then others that are HPG (high throughput grid) – national and then European
how do scientists get computational time (similar to like a telescope or other big science) proposal, granted, scheduled...
5 paradigms on deisa
1: simple scripts & control
- write in emacs or whatever
- use unicore client with the scripts - grid middleware does this
2: Scientific Application plug-ins
3: Complex workflow
- example QSAR
4: Interactive access
- more frequently physicists – ssh connection to the machine – to check results during computations
- computational steering – changing parameters on the fly
- collaborative online visualization and computational steering (COVS)
5: Interoperability
- using multiple grids
- grids are 6x and 4x overbooked
- can’t use the same middleware for both of their grids
Labels: IEEEeScience08
IEEE eScience: MyExperiment
MyExperiment.org
myExperiment: Defining the Social Virtual Research Environment
David De Roure, University of Southampton
The point of this talk is to define what this thing is.
social process of science
- more to share
data, metadata, provenance, workflows, ontologies
these are the things that myexperiment is meant to help people share
workflow systems includes trident, taverna, kepler, triana, Ptolemy II
automate work to make reproducible and shareable
these workflows are difficult to build so it’s good to start with something – share
also they capture the context
it’s like sharing matlab scripts or models… not really sharing data, although you can do that, too.
example workflow for identifying biological pathways impliciated in resistance to x in cattle
jo working on whipworm in mouse – meets paul, uses his workflow without being changed, and fins the biological pathways for sex dependence in mice.
is
community social network
fine control over sharing
federated repository
gateway to other publishing environments
platform for launching workflows
started 3/2007, open beta 11/2007 – 1331 users, 536 workflows
workflow – social metadata including
license
credits
attributions
tags
uploade
(q: how do scientists search for and find relevant workflows? - he says there's a paper on their wiki)
groups – sometimes more formal to control access to information
metrics are complicated – what do downloads or visitors mean?
but more content is good
more content – harder to find – but more people, so more social and use information to help you find it (can add in recommender systems)
other “social web sites for scientists” are really just linkedin but with different people – they’re trying to differentiate themselves.
credits & attributions (critical)
fine control over visibility and sharing
packs (make a collection)
federation
enactment (actually run workflows)
Packs –
obvious but really necessary – to make a shopping basket (ppt slide from a workflow or collection of workflows)
collect internal things or link to external things
share, tag, discover, and discuss
enactor is complicated – if workflow has to be in the same place as the data, or if it doesn’t then…
It’s open source, and it also has RESTful sources, Google gadgets
Taverna Plugin
export packs as OAI-ORE (they would like someone to consume this – they’ve produced it, but would like to see repositories using it)
there’s also SPARQL searching
using SIOC: semantically-interlinked online communities
software design for empowering scientists (their article – next issue IEEE software)
six principles of software design to empower scientists
1 fit in, don’t force change
2 jam today and more jam tomorrow
3 just in time and just enough
4 act local, think global
5 enable users to add value
6 design for network effects
the process
discovery > acquisition > prep/cleaning > transformation > interpretations > dissemination > curation
across all of these, provenance… and two other things that I didn’t type fast enough and have forgotten
can they make an e-laboratory? (it’s the office of the future :) ) – can there be an ecosystem of cooperating e-laboratories
what are the research objects?
new project BioCatalogue – a catalog of biosciences web services (http://www.biocatalogue.org/)
What is a virtual research environment?
facilitate mgmt and sharing of research objects (whatever they are)
support the social model?
open extensible environment
platform to action research – to deliver research objects to remote services and software.
future – sharing R, matlab, statistical models
question: is the community big enough for reputation?
answer: well primary reputation comes through publication – but as the site grows this is developing, favoriting , annotations.
Labels: IEEEeScience08
IEEE eScience: My Slides
Slightly more information in
an earlier post.
Paper available only upon request until I figure out the rules (just e-mail me) - it will probably be in Xplore in a couple of months.
Labels: IEEEeScience08, science blogging, sna
IEEE eScience: Acoustic Environment Observatory
Toward an Acoustic Environment Observatory
Authors from QUT
(Paul Roe from Microsoft QUT eResearch Centre speaking)
Ecosystem processes –
standard sensor monitor cycling of water and nutrients, energy transfers – but you need something different to look at biodiversity and species interactions with the environment – they’re using acoustics
scale up from traditional processes of monitoring with eyes and ears in the field (and taperecorders)
what to study with acoustics?
- detect
- behavior
- examples climate change (detect migratory species), invasive species, indicator species
- (they’re doing terrestrial, much has already been done with underwater
trying to do this at scale
- cheap sensors, automatic processing
- system is always evolving (data errors, broken sensors)
difference between – sensing the temperature 1x/hour and send, vs. continuous collection (kind of like reverse podcasting). Very different from like home security surveillance testing
data > middleware to manage sensors > signal labeling manual/semi-auto/auto (SciBot)> semantics, tag meaning and inference (behaviors and annotations)
they use solar power and cell phones with special software
web interface for scientists to schedule recordings, phones use 3G to upload the data
power is limiting – use too much for go too long without the sun – also uploading takes longer in bad weather
signal processing
- manual
- template based like speech recognition
- information theoretic approach to find signal
analysis and visualization
- point data this bird, this call, this time and place -- weave together to understand more about environmental health
useful in some environmental impact studies, tracking this rare and hard to see bird
also, why do koalas bellow
can listen: www.mquter.qut.edu.au/sensor
Labels: IEEEeScience08
IEEE eScience: Opening keynote
really bare bones notes
Rich Wolski, Keynote Speaker
UCSB
Building Science Clouds Using commodity, Open-Source Software Components
Start with what’s happening in the commercial clouds and work toward the scientific world
A lot of people getting excited in distributed computing – commercial entities going ahead, both big and small
virtualization < web services < SLAs – users : need to automate as much as possible, also need to make it clear to users what they are getting when they rent space in the cloud
public cloud vs. private cloud
What can be done in a cloud? – workflows are great, but there are things that don’t work well in grids
what extensions or mods are required for scientific applications, data assimilation, multiplayer gaming (latency constraints)
How do clouds interact with other systems – like mobile devices which are already in/on their own cloud
open source cloud: simple, extensible, widely available and popular technologies, easy to install and maintain
examples: U Chi, Nimbus, on GT4 and Globus (from grid computing), but not looking for when grids act like couds
Enomalism now called ECP (startup with open source), difficult and pretty opaque
They’re building on Eucalyptus.cs.ucsb.edu: elastic utility computing architecture linking your programs to useful systems
- linux (like Amazon)
How to know if it’s a cloud? try things on it that you can do on amazon web services (like in the interface) and see if it can do these things (no strict definition of “cloud” just knew a cloud when they saw one, AWS).
- software overlay so it didn’t really monkey with the underlying infrastructure (with some grid things you had to blow away your operating system, install a whole bunch of libraries, and it was really difficult to know what it was doing and to support it)
A driver of this was to save money – researchers wanted to work in the commercial cloud, but it’s really expensive. If nothing else, they can use this to debug before moving into the commercial cloud.
Goal to more like democratize – not to replace Amazon and Google services at all – but to allow for other people to try things, but you won’t have their data centers, you’ll still need to have the hardware (and other things).
Interface is based on Amazon’s published WSDL, EC2 command-line tools, REST interface
sys admin – cloud admin, toolset for user accounting and cloud management
Security – they figured it out (WS-security, SSH key generation, etc.), but makes installation and sequencing a bit complicated – refer to his slides
Performance – if Xen is installed right, they haven’t seen the performance hits people are concerned about with virtualization
You can play with the Eucalyptus Public Cloud – with limitations
Benchmarking tests show it’s running very fast and responding like the amazon services
Part of VGrADS – an NSF project – Linked Environments for Atmospheric Discovery – real time use of Doppler radar for local/regional forecasting
… lots of details
clouds and grids are distinct
cloud
- full private cluster is provisioned
- individual can only get a tiny fraction of the total resource pool
- no support for cloud federation
- opaque wrt resources
grid
- built so that a single user can get most or use the entire cloud for a single project
- federation as first principle
- resources are exposed
Questions from the audience:
- teragrid – I guess currently a 3week queue? maybe use this to get projects in faster?
- SLAs how to actually monitor and enforce?
Labels: IEEEeScience08
Comps readings this week
McCain, K. W. (1990). Mapping authors in intellectual space: A technical overview.
Journal of the American Society for Information Science, 41(6), 433-443.
Highly recommended. Pretty short, straight to the point, software references are of course dated, but not so much so that you can't figure out a current equivalent. Explains what the deal is with
ACA - both the raw graph, how to get the authors, and then how to do similarity measures and display the information.
Glanzel, W., & Moed, H. F. (2002). Journal impact measures in bibliometric research. Scientometrics, 53(2), 171-193.
Pretty straight forward review. Easy to read. One thing that I believe but that surprises me (hadn't occurred to me before):
ISI classifies documents into types. In calculating the numerator of the IF, ISI counts citations to all types of documents, whereas as citable documents in the denominator ISI includes as a standard only normal articles, notes, and reviews. However, editorials, letters and several other types are cited rather frequently in a number of journals. (p.181)
Really?! Makes sense, but hmmm. Apparently really inflates Lancet ("real" IF would be 43% lower).
Also nice discussion of issues with the IF, journal aging/productivity (cited half-life really isn't appropriate), some of the other options (and speculations why they haven't caught on), can't use normal distributions - have to use Pareto or other skewed (negative binomial, geometric, Poisson...)- to compare IFs.
DeSanctis, G., & Poole, M. S. (1994). Capturing the complexity in advanced technology use: Adaptive structuration theory.
Organization Science, 5(2), 121-147.
Oh, the horror. I guess I just don't get it. Seemed to be covering the same ground as some of the social informatics pieces...
Borgman, C. L., & Furner, J. (2002). Scholarly communication and bibliometrics. In B. Cronin (Ed.), Annual Review of Information Science and Technology (ARIST) (pp. 2-72). Medford,NJ: Information Today. doi:10.1002/aris.1440360102
I think I've read parts of this on at least 3 other occasions, but it was important to read from end to end.
2 Chapters of Downey, G. L. (1998).
The machine in me: An anthropologist sits among computer engineers. New York: Routledge.
This warrants more discussion, but that will keep for its own post.
Slight slow down here in the reading process as 1) life continually intrudes, sigh, and 2) slides are getting prepared (I spent a bunch of time trying to make prettier graph pictures based on some oblique criticism of a very well known info viz guy, but oh well!)
Labels: comps
Some notes for Pajek > R > iGraph
These are mostly so I can find them again (argh)....
Install
R, package iGraph,
PajekIn Pajek "locate" R:
tools > R > locate R ... look in c: > program files > R > R-2.7.1 > bin > Rgui.exe (your mileage and R version may vary)
Load your data into Pajek (with a .net, under networks click on the folder)
Do whatever manipulations in Pajek that you want to do (drop isolates, it's much easier in Pajek)
Tools > R > Send to R > Current network
If you had a version of R open, it will open another. No point in opening R first. When R opens, you'll have the information from pajek:
######################################
R called from Pajek
http://vlado.fmf.uni-lj.si/pub/networks/pajek/
Vladimir Batagelj & Andrej Mrvar
University of Ljubljana, Slovenia
-----------------------------------------------------------------------
The following networks/matrices read:
n1 : C:/Documents and Settings/...(1182)
Use objects() to get list of available objects
Use comment(?) to get information about selected object
Use savevector(v?,'???.vec') to save vector to Pajek input file
Use savematrix(n?,'???.net') to save matrix to Pajek input file (.MAT)
savematrix(n?,'???.net',2) to request a 2-mode matrix (.MAT)
Use savenetwork(n?,'???.net') to save matrix to Pajek input file (.NET)
savenetwork(n?,'???.net',2) to request a 2-mode network (.NET)
Use v?<-loadvector('???.vec') to load vector(s) from Pajek input file
Use n?<-loadmatrix('???.mat') to load matrix from Pajek input file
-----------------------------------------------------------------------It's actually called n1 (yes, they say this, but...)
load the igraph package (packages > load)
You now have to make your adjacency matrix into a "graph"
g<-graph.adjacency(n1, mode=c("directed"), weighted=NULL, diag=TRUE, add.colnames=NULL, add.rownames=NA)
then you can sort of cookbook from the
igraph page.... use their models and their in-progress
tutorial book
Oh and name everything you call, I just did the community detection and forgot to add the name<- in front of it
Labels: sna
4th IEEE eScience Conference: Detecting Communities in Science Blogs
(I'm duplicating my abstract here - feel free to ask questions before, during, or after my presentation using the comments. They are moderated, but I will post all that are not obviously spam)
For real time discussion, use the chat page:
http://live.escience2008.iu.edu:4321/bin/meeting.html#ID=23, 2008/12/10 from
SCHEDULE HERE SAYS 1:30-2 American Eastern Standard Time (EST).
Detecting Communities in Science Blogs
Many scientists maintain blogs and participate in online communities through their blogs and other scientists’ blogs. This study used social network analysis methods to locate and describe online communities in science blogs. The structure of the science blogosphere was examined using links between blogs in blogrolls and in comments. By blogroll, the blogs are densely connected and cohesive subgroups are not easily found. Using spin glass community detection, six cohesive subgroups loosely corresponding to subject area were found. By commenter links, the blogs form into more easily findable general subject area or interest clusters.
UPDATED TIME
Labels: e-science